In Bayesian inference, one typically seeks to carry out the calculation of the posterior distribution of the parameters that make up a certain model from the available information and assumptions about how the distribution prior to the inference process (a priori distribution) would behave. The typical approach is to employ some MCMC method, such as gibbs or metropolis hasting; however, relatively recently more efficient approaches have emerged.
Laplace Integrated Nested Approach or INLA is a relatively recent method of fitting Bayesian models. The INLA approach aims to solve the computational difficulty of MCMC in data-intensive problems or complex models. In many applications, the posterior distribution sampling process using MCMC can take too long and is often not even feasible with existing computational resources.
First of all, we load all the needed packages to do this workshop and the data
library(kableExtra)
library(tidyverse)
library(INLA)
library(yardstick)
library(gt)
library(innovar)
library(spdep)
library(reshape2)
library(viridis)
db <- readRDS("db_excess_proc_dis.rds")
# Total count of deaths
db.tl <- readRDS("db_excess_proc_agr.rds") we proceed to do a temporal descriptive analysis, to check the evolution of the number of deaths we employ ggplot package.
db.tl %>%
ggplot()+
geom_line(aes(x=date,y=n),color="#aa3d01",lwd=0.5)+
geom_point(aes(x=date,y=n),color="#aa3d01",lwd=2.5,colour="#aa3d01",shape=21)+
ylab("Numero de muertes")+
xlab("")+
ggtitle("Evolution of the number of deaths")+
theme_bw()+
theme(plot.title = element_text(hjust = 0.5,size = 28),
strip.background =element_rect(fill="#850503"),
strip.text = element_text(color = "white",face = "bold",size = "5pts")) similarly, we can check the above in a spatial context, in order to do that we load a shapefile that contain all the geographic information of Peru and joint with our data :
#load shapefile
data("Peru")
db.sp.peru <- Peru %>%
group_by(reg,prov) %>%
summarise() %>%
left_join( db %>%
group_by(year,reg,prov) %>%
summarise(n=mean(n))) later we can graph the number of deaths using ggplot again
#plot
db.sp.peru %>%
ggplot()+
geom_sf(aes(fill=n))+
theme_bw() +
facet_wrap(vars(year)) +
scale_fill_viridis(name="Average number\n of deaths",direction = -1,option = "rocket")+
theme(plot.title = element_text(hjust = 0.5,size = 30),
strip.background =element_rect(fill="#aa3d01"),
strip.text = element_text(color = "white",face = "bold",size = 25)) To carry out the adjustment of models using the INLA approach we can use the inla() function, which has several parameters, some of the most important are :
data : An object typically of the class data.frame, data to adjust any model.
formula : Un objeto de la clase formula que especifica la ecuacion que pretendemos ajustar como por ejemplo inla(target ~ 1 + x1). En la formula podemos especificar efectos lineales (introduciendo el nombre de la variable) o no lineales (empleando f()).
verbose : A variable of the type boolean, which indicates if you want to show the convergence process in the console
peru.m0 <- inla(n ~ 1 + annus,
verbose = F,
data = db
)The parameters detailed above are the essential ones to execute the adjustment of a model using INLA. However, some extra parameters to consider are the following:
family: Class object character. This parameter is crucial, as it determines the distribution of the target variable, by default it is infamily = Gaussian.control.compute:Object of class list allows to specify the calculation of information criteria such asaic, dic,waic.peru.m1 <- inla(n ~ 1 + annus,
verbose = F,
data = db,
family = "Gaussian",
control.compute = list(dic=TRUE, waic=TRUE,cpo=TRUE),
control.predictor = list(link = 1),
)In addition, we can use more variables in the linear predictor or employ other assumption of the distribution of the target variable in order to obtain a better model. Given we are modelling a count process , we’ll use a negative binomial distribution as assumption and employ socio-economic and climatic variables that can explain the variability of the number of deaths in Peru.
peru.m2 <- inla(n ~ 1 + annus + temperature+ INEI_prop_aseg_2017 + prop_pobreza_2018 + INEI_Prop_NoAlumbradoElectri_2017 + INEI_prop_NoAbsAguVvn_int_2017,
verbose = F,
data = db,
family = "nbinomial",
control.compute = list(dic=TRUE, waic=TRUE,cpo=TRUE),
control.predictor = list(link = 1),
)Until now, we have only considered linear effects but with INLA we can also model non-linear effects that can control for temporal and spatial effects
when considering temporal effects, we are implicitly assuming that the time series can be decomposed as follows
\[\begin{align*} y_{t} = S_{t}+T_{t}+e_{t} \end{align*}\]
where :
\(S_{t}\) : Seasonality
\(T_{t}\) : Trend
\(e_{t}\) : white noise
To model this components in INLA, we can use :
* AR1 (1st order autoregressive process) : f(variable, model = "ar1")
* RW1 (1st order random walk) : f(variable, model = "rw1")
* RW2 (2nd order random walk) : f(variable, model = "rw2")
# Assuming an a priori Distribution for the years: "iid"
peru.m3 <- inla(n ~ 1 + annus + temperature+ INEI_prop_aseg_2017 + prop_pobreza_2018 + INEI_Prop_NoAlumbradoElectri_2017 + INEI_prop_NoAbsAguVvn_int_2017,
control.compute = list(dic=TRUE, waic=TRUE,cpo=TRUE),
control.predictor = list(link = 1),
verbose = F,
family = "nbinomial",
data = db
)
# Assuming an a priori Distribution for the weeks: "rw1"
peru.m4 <- inla(n ~ 1 + annus + f(week,model="rw1") + temperature+
INEI_prop_aseg_2017 + prop_pobreza_2018 + INEI_Prop_NoAlumbradoElectri_2017 + INEI_prop_NoAbsAguVvn_int_2017,
control.compute = list(dic=TRUE, waic=TRUE,cpo=TRUE),
control.predictor = list( link = 1),
verbose = F,
family = "nbinomial",
data = db
)
# both priors at same time
peru.m5 <- inla(n ~ 1 + f(annus,model="ar1")+ f(week,model="rw1")+ temperature+
INEI_prop_aseg_2017 + prop_pobreza_2018 + INEI_Prop_NoAlumbradoElectri_2017 + INEI_prop_NoAbsAguVvn_int_2017,
control.compute = list(dic=TRUE, waic=TRUE,cpo=TRUE),
control.predictor = list( link = 1),
verbose = F,
family = "nbinomial",
data = db
)the assumption about the decomposition of the time series can become more complex when considering the spatial dimension, so
\[\begin{align*} y_{t} = S_{t}+T_{t}+\nu_{t}+u_{t}+e_{t} \end{align*}\]
where:
\(\nu_{t}\) : structured effects
\(u_{t}\) : unstructured effects
The 2 new components \(\nu_ {t},u_{t}\) are intended to take into account in the prediction the spatial correlation present in the data. About :
The unstructured effects correspond to random effects that attempt to control unobserved characteristics of each area under study. To model them we can use f (area, model =" iid ")
Structured effects are random effects that explicitly take spatial structure into account. They can be modeled in INLA in various ways::
f(area, model = "besag")f(area, model = "besagproper")To include these priors as spatial effects, first we need to collect the geographic information of Peru expressed in a shapefile, at the provincial level. For which, in addition to calling the shapefile, by grouping the data at the provincial level and using the summarize function, we seek to condense the polygons of the shapefile to that level
data("Peru")
peru.prov<-Peru %>%
group_by(reg,prov) %>%
summarise()Subsequently, we create a neighborhood structure from the shapefile, and in turn, from said neighborhood structure, we create the spatial weight matrix
# Creando la estructura de vecinos mas cercanos
peru.adj <- poly2nb(peru.prov)
# Pesos espaciales
W.peru <- nb2mat(peru.adj, style = "W") Finally, we carry out the cleaning of non-existent provinces; as well as the creation of an id for each province, in order to identify each polygon. The latter represents an extra requirement to fit spatial models in INLA
db.peru.sp<- db %>%
group_by(prov) %>%
mutate(id.sp=cur_group_id())Now we are ready to fit a model which effects by province are correlated spatially :
peru.m6 <- inla(n ~ 1 + annus +
f(week,model="rw1") +
f(id.sp, model = "bym",
graph=W.peru) +
temperature + INEI_prop_aseg_2017 + prop_pobreza_2018 +
INEI_Prop_NoAlumbradoElectri_2017 + INEI_prop_NoAbsAguVvn_int_2017,
data = db.peru.sp,
control.compute = list(dic = TRUE,
waic = TRUE,
cpo = TRUE),
control.predictor = list(link = 1)
)it is worth mentioning that is easy to assum that these province effects are independent using f(area, model = "iid") :
peru.m7 <- inla(n ~ 1 + annus +
f(week,model="rw1") +
f(id.sp, model = "iid",
graph=W.peru) +
temperature + INEI_prop_aseg_2017 + prop_pobreza_2018 +
INEI_Prop_NoAlumbradoElectri_2017 + INEI_prop_NoAbsAguVvn_int_2017,
data = db.peru.sp,
control.compute = list(dic = TRUE,
waic = TRUE,
cpo = TRUE),
control.predictor = list(link = 1)
)On the other hand, we can check our estimates and plot them. So, firt we access to the linear effects fitted using InlaObject$summary.fixed
f.m3<-peru.m3$summary.fixed %>% mutate(Model="linear") %>% rownames_to_column("Variable")
f.m4<-peru.m4$summary.fixed %>% mutate(Model="linearRW(1)") %>% rownames_to_column("Variable")
f.m5<-peru.m5$summary.fixed %>% mutate(Model="AR(1)RW(1)") %>% rownames_to_column("Variable")
f.m6<-peru.m6$summary.fixed %>% mutate(Model="linearRW(1) with spatial effects ") %>% rownames_to_column("Variable")
f.m7<-peru.m6$summary.fixed %>% mutate(Model="linearRW(1) with iid effects ") %>% rownames_to_column("Variable")
fix.data<-rbind(f.m3,f.m4,f.m5,f.m6,f.m7)%>%
filter(!str_detect(Variable, 'annus'))and plot the coefficients
fix.data %>%
ggplot(aes(colour=Model)) +
geom_linerange(aes(x = Variable,
ymin = `0.025quant`,
ymax = `0.975quant`),
position = position_dodge(width = 1/2),
lwd=1) +
geom_pointrange(aes(x = Variable, y=mean,
ymin = `0.025quant`,
ymax = `0.975quant`
),position = position_dodge(width = 1/2), lwd = 1/2,shape=21, fill = "WHITE") +
scale_color_manual(values = c("#850503","#f55e2c","#df861d","#ffd16c","#f9ebac"))+
ggtitle("Fixed effects") +
geom_hline(yintercept = 0, colour = gray(1/2), lty = 2) +
coord_flip() +
theme_bw()we can proceed to analyse our fitted models in the space . so we access to fitted values with InlaObject$$summary.fitted.values$mean
db.peru.sf<- Peru %>%
group_by(reg,prov) %>%
summarise() %>%
inner_join(db.peru.sp %>%
ungroup() %>%
mutate(fit=peru.m6$summary.fitted.values$mean),c("reg"="reg","prov"="prov")) %>%
group_by(reg,prov,year) %>%
summarise(n=mean(n),
fit=mean(fit))
map.true<-db.peru.sf %>%
filter(year==2019) %>%
ggplot()+
geom_sf(aes(fill=n))+
theme_bw() +
#facet_wrap(vars(year)) +
scale_fill_viridis(name="Average number\n of deaths",option="rocket",direction = -1)+
theme(plot.title = element_text(hjust = 0.5,size = 35),
strip.background =element_rect(fill="#aa3d01"),
strip.text = element_text(color = "white",face = "bold",size = 25))
# scale_fill_distiller(name="Average number\n of deaths",direction = -1)
map.fit<-db.peru.sf %>%
filter(year==2019) %>%
ggplot()+
geom_sf(aes(fill=fit))+
theme_bw() +
#facet_wrap(vars(year)) +
scale_fill_viridis(name="Average number\n of deaths",option="rocket",direction = -1)+
theme(plot.title = element_text(hjust = 0.5,size = 35),
strip.background =element_rect(fill="#aa3d01"),
strip.text = element_text(color = "white",face = "bold",size = 25))
# scale_fill_distiller(name="Average number\n of deaths",direction = -1)
cowplot::plot_grid(map.true,map.fit)In order to assess the models, we will calculate in-sample performance metrics.First, we collect the fitted values from the INLA object
fit.m.m3 <- peru.m3$summary.fitted.values$mean
fit.m.m4 <- peru.m4$summary.fitted.values$mean
fit.m.m5 <- peru.m5$summary.fitted.values$mean
fit.m.m6 <- peru.m6$summary.fitted.values$mean
fit.m.m7 <- peru.m7$summary.fitted.values$mean
n.peru <- db.peru.sp$n
datos <- list("linear"=fit.m.m3,"linearRW(1)"=fit.m.m4,"AR(1)RW(1)"=fit.m.m5,"spatial effects"=fit.m.m6,"iid effects"=fit.m.m7) %>%
as.data.frame() %>%
melt( variable.name = "modelo",
value.name = "fit") %>%
group_by(modelo) %>%
mutate(actual = n.peru,
date = db.peru.sp$date,
reg = db.peru.sp$reg,
prov = db.peru.sp$prov)then to facilitate the calculation of the performance metrics we transform the data to long format and then we use the yardstick package to calculate the next metrics :mae,mape,mpe,rmse,msd. We can calculate these metrics for all the dataset.
#Metrics to use
perform.metrics <- metric_set(mae,mase,smape,rmse,msd)
# Calculation of metrics in the forecast year
tbl.yrd.full <- datos %>%
group_by(modelo) %>%
perform.metrics(truth = actual,
estimate = fit)And per province
#Metrics to use
perform.metrics <- metric_set(mae,mase,smape,rmse,msd)
# Calculation of metrics in the forecast year
tbl.yrd.per <- datos %>%
group_by(modelo,reg,prov) %>%
perform.metrics(truth = actual,
estimate = fit)Finally, we use gt package to show in a customized table the results
# Results table
tbl.yrd.full %>%
pivot_wider(id_cols = modelo,
names_from = .metric,
values_from = .estimate) %>%
gt() %>%
tab_header(
title = md("in-sample accuracy metrics")#,
#subtitle ="out-of-sample accuracy metrics"
) %>%
data_color(
columns = vars(mae,mase,smape,rmse,msd),
colors = scales::col_numeric(
palette = c(
"#aa3d01","white"),
domain = NULL)
) %>% tab_footnote(
footnote = "mae = mean absolute error",
locations = cells_column_labels(columns = mae)
) %>% tab_footnote(
footnote = "mase = Mean absolute scaled error",
locations = cells_column_labels(columns = mase))%>%
tab_footnote(
footnote = "smape = Symmetric mean absolute percentage error",
locations = cells_column_labels(columns = smape))%>%
tab_footnote(
footnote = "rsme = Root square mean error",
locations = cells_column_labels(columns = rmse))%>%
tab_footnote(
footnote = "msd = Mean signed deviation",
locations = cells_column_labels(columns = msd))| in-sample accuracy metrics | |||||
|---|---|---|---|---|---|
| modelo | mae1 | mase2 | smape3 | rmse4 | msd5 |
| linear | 8.860026 | 3.2036268 | 88.08911 | 37.886125 | 1.201510e+00 |
| linearRW.1. | 8.850893 | 3.2003246 | 88.03417 | 37.875527 | 1.213427e+00 |
| AR.1.RW.1. | 8.850257 | 3.2000946 | 88.03376 | 37.876086 | 1.215069e+00 |
| spatial.effects | 2.722575 | 0.9844346 | 64.17138 | 7.235505 | -1.837196e-04 |
| iid.effects | 2.716929 | 0.9823931 | 64.08271 | 7.230243 | -3.739398e-05 |
|
1
mae = mean absolute error
2
mase = Mean absolute scaled error
3
smape = Symmetric mean absolute percentage error
4
rsme = Root square mean error
5
msd = Mean signed deviation
|
|||||
And again we can assess this performance spatially, in this case by province . So for example we can plot the spatial distribution of the mean absolute error for the models spatial effects correlated and don’t correlated
tbl.yrd.per.sf<-Peru %>%
group_by(reg,prov) %>%
summarise() %>%
inner_join(tbl.yrd.per,by=c("reg"="reg","prov"="prov"))
# map.sp<-
tbl.yrd.per.sf %>%
filter(.metric=="mae" & modelo %in%c("iid.effects","spatial.effects")) %>%
ggplot()+
geom_sf(aes(fill=.estimate),lwd=0.1) +
scale_fill_viridis(name="MAE",option="rocket",direction = -1)+
facet_wrap(vars(modelo))+
theme_bw()+
theme(plot.title = element_text(hjust = 0.5,size = "20pts"),
strip.background =element_rect(fill="#aa3d01"),
strip.text = element_text(color = "white",face = "bold",size = 25)) In order to predict the number of death in 2019 we need to stablish this year as NA in the dataset so :
db.peru.frct <- Peru %>%
group_by(reg,prov) %>%
summarise() %>%
inner_join(db.peru.sp %>%
mutate(n=ifelse(annus == 2019 ,NA,n)),c("reg"="reg","prov"="prov"))and finally in a similar way as above, we assess the forecasted deaths spatially
peru.m8 <- inla(n ~ 1 + f(annus,model = "iid") +
f(week,model="rw1") +
f(id.sp, model = "bym",
graph=W.peru)+
temperature+ INEI_prop_aseg_2017 + prop_pobreza_2018 +
INEI_Prop_NoAlumbradoElectri_2017 + INEI_prop_NoAbsAguVvn_int_2017,
data = db.peru.frct,
control.compute = list(dic = TRUE,
waic = TRUE,
cpo = TRUE),
control.predictor = list(link = 1)
)
db.peru.frct<-db.peru.frct %>%
ungroup() %>%
mutate(fit=peru.m6$summary.fitted.values$mean) %>%
group_by(reg,prov,year) %>%
summarise(fit=mean(fit),
n=mean(n))
map.true<-db.peru.sf %>%
filter(year==2019 ) %>%
ggplot()+
geom_sf(aes(fill=n))+
theme_bw() +
#facet_wrap(vars(year)) +
scale_fill_viridis(name="Average number\n of actual deaths",direction = -1,option="rocket")+
theme(plot.title = element_text(hjust = 0.5,size = "20pts"),
strip.background =element_rect(fill="#aa3d01"),
strip.text = element_text(color = "white",face = "bold",size = 25))
map.frct<-db.peru.frct %>%
filter(year==2019 ) %>%
ggplot()+
geom_sf(aes(fill=fit))+
theme_bw() +
#facet_wrap(vars(year)) +
scale_fill_viridis(name="Average number\n of forecasted deaths",direction = -1,option="rocket")+
theme(plot.title = element_text(hjust = 0.5,size = "20pts"),
strip.background =element_rect(fill="#aa3d01"),
strip.text = element_text(color = "white",face = "bold",size = 25))
cowplot::plot_grid(map.true,map.frct)